Attorney's Docket No.: 10559-642001 / P12486 



APPLICATION 
FOR 

UNITED STATES LETTERS PATENT 



TITLE: MICROINSTRUCTION POINTER STACK IN A 

PROCESSOR 

APPLICANT: MICHAEL P. CORNABY AND BEN CHAFFIN 



Scott C. Harris 

Fish & Richardson P.C. 

4350 La Jolla Village Drive 

Suite 500 

San Diego, CA 92122 
Telephone: 858-678-5070 
Facsimile: 858-67 8-5099 



CERTIFICATE OF MAILING BY EXPRESS MAIL 

Express Mail Label No. EU 047 039 083 US 

I hereby certify that this correspondence is 
being deposited with the United States Postal 
Service as Express Mail Post Office to Addressee 
with sufficient postage on the date indicated 
below and is addressed to the Commissioner for 
Patents, Washington, D.C. 20231. 




Typed or Printed Name of Person Signing 
Certificate 



10559-642001 / P12486 

MICROINSTRUCTION POINTER STACK IN A 

PROCESSOR 

TECHNICAL FIELD 

This invention relates to a microinstruction pointer stack 
in a processor. 

BACKGROUND 

A microprocessor is a computer processor on a microchip. 
The microprocessor is typically designed to perform arithmetic 
and logic operations that make use of small number-holding areas 
called registers. Typical microprocessor operations include 
adding, subtracting, comparing, and fetching operands from 
memory or registers. These operations result from execution a 
set of instructions that comprise a program. The set of 
instructions are part of the microprocessor design. 

DESCRIPTION OF DRAWINGS 

FIG. 1 is a block diagram of a processor. 

FIG. 2 is a block diagram of an executive environment of 
the processor of FIG. 1. 

FIG. 3 is a diagram of an out of order microinstruction 
pointer stack implemented in the out of order execution core of 
FIG. 1. 

FIG. 4 is a flow diagram of a pIP stack process. 
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DETAILED DESCRIPTION 

Referring to FIG. 1 a processor 10 is shown. The processor 
10 is a three way super scaler, pipelined architecture. Using 
parallel processing techniques, the processor 10 decodes, 
dispatches, and completes execution of (retire) three 
instructions per clock cycle. To handle this level of 
instruction throughput, the processor 10 uses a decoupled, e.g., 
twelve stage pipeline that supports out of order instruction 
execution. The pipeline of the processor 10 is divided into 
four sections, i.e., a first level cache 12, a second level 
cache 14, a front end 16, an out of order execution core 18, and 
a retire section 20. Instructions and data are supplied to 
these units through a bus interface unit 22 that interfaces with 
a system bus 24. The front end 16 supplies instructions in 
program order to the out of order execution core 18 that has 
very high execution bandwidth and can execute basic integer 
operations with one-half clock cycle latency. The front end 16 
fetches and decodes instructions into simple operations called 
micro-ops (pops) . The front end 16 can issue multiple pops per 
cycle, in original program order, to the out of order execution 
core 18. The front end 16 performs several basic functions. 
For example, the front end 16 performs prefetch instructions 
that are likely to be executed. The front end 16 decodes 
instructions into micro operations and generates micro code for 
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complex instructions, delivers decoded instructions from an 
execution trace cache 2 6, and predicts branches using advanced 
algorithms in a branch prediction unit 28. 

The front end 16 of the processor 10 addresses some common 
problems in high speed, pipelined microprocessors. Two of these 
problems, for example, contribute to major sources of delays. 
These problems are the time to decode instructions fetched from 
the target and time wasted to decode instructions due to 
branches or branch targets that occur in the middle of cache 
lines . 

The execution trace cache 26 addresses both of these issues 
by storing decoded instructions. Instructions are fetched and 
decoded by a translation engine (not shown) and built into 
sequences of pops called traces. These traces of pops are 
stored in the trace cache 26. The instructions from the most 
likely target of a branch immediately follow the branch, without 
regard for continuity of instruction addresses. Once a trace is 
built, the trace cache 26 is searched for the instruction that 
follows that trace. If that instruction appears as the first 
instruction in an existing trace, the fetch and decode of 
instructions 30 from the memory hierarchy ceases and the trace 
cache 2 6 becomes the new source of instructions. 

The execution trace cache 18 and the translation engine 
(not shown) have cooperating branch prediction hardware. Branch 
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targets are predicted based on their linear addresses using 
Branch Target Buffers (BTBS) 28 and fetched as soon as possible. 
The branch targets are fetched from the trace cache 2 6 if they 
are indeed cached there; otherwise, they are fetched from the 
memory hierarchy. The translation engine's branch prediction 
information is used to form traces along the most likely paths. 

The core 18 executes instructions out of order enabling the 
processor 10 to reorder instructions so that if one pop is 
delayed while waiting for data or a contended execution 
resource, other pops that are later in a program order may 
execute before the delayed pops. The processor 10 employs 
several buffers to smooth the flow of jiops. This implies that 
when one portion of the pipeline experiences a delay, that delay 
may be covered by other operations executing in parallel or by 
the execution of pops which were previously queued up in one of 
the buffers. 

The core 18 is designed to facilitate parallel execution. 
The core 18 can dispatch up to six pops per cycle; note that 
this exceeds the trace cache 2 6 and retirement 20 pop bandwidth. 
Most pipelines can start executing a new pop every cycle, so 
that several instructions can be processed any time for each 
pipeline. A number of arithmetic logical unit (ALU) 
instructions can start two per cycle, and many floating point 
instructions can start one every two cycles. Finally, pops can 
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begin execution, out of order, as soon as their data inputs are 
ready and resources are available. 

The out of order execution core 18 includes an out of order 
microinstruction pointer (IP) stack 100. In general, a stack is 
a data area or buffer used for storing requests that need to be 
handled. A stack is typically a push-down list, meaning that as 
new requests come into the stack, the stack pushes down older 
requests. Another way of looking at a push-down list - or stack 
- is that a program usually takes its next item to handle from 
the top of the stack, unlike other arrangements such as "FIFO" 
or "first-in first-out" buffers. The stack 100 is implemented 
in a microcode environment. This allows fast subroutine returns 
in microcode. It also allows fast assist returns in microcode. 

The ]iIP stack 100 is different from a macroinstruction 
stack in several ways. For example, the \iIP stack 100 is not 
visible from a system level (i.e., the pIP stack 100 cannot be 
directly manipulated from macrocode) . The uIP stack 100 is an 
out-of-order stack where values are placed on the stack and 
removed from the stack before it is known if the sequence of 
operations were valid. Thus, a set of recovery mechanisms is 
required to correct a sequence of operations that are later 
determined to be incorrect. The process of manipulating the 
stack (PUSH, POP, etc.) is very different from a traditional 
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macroinstruction stack because of the out-of-order nature of the 
stack 100. 

The jiIP stack 100 provides a mechanism for improving the 
performance of microcode (jucode) execution. Microcode is 
programming that is ordinarily not program-addressable but, 
unlike hardwired logic, is capable of being modified. Microcode 
may sometimes be installed or modified by a device's user by 
altering programmable read-only memory (PROM) or erasable 
programmable read-only memory (EPROM) . 

The (IIP stack 100 provides a lower-overhead ability to jump 
to various subroutines and use "assists" to efficiently 
accomplish jicode functions. The Jul P stack 100 has significant 
performance and p,code efficiency implications that permeate 
numerous instructions. For example, use of the |iIP stack 100 
improves performance by removing indirect jicode jumps and allows 
jLicode to share routines more easily by removing subroutine 
penalties. By removing subroutine penalties, the juIP stack 100 
allows jucode to share routines more easily. 

The retirement section 20 receives the results of the 

executed pops from the execution core 18 and processes the 

results so that the proper architectural state is updated 

according to the original program order. For semantically 

correct execution, the results of instructions are committed in 

original program order before the instructions are retired. 

6 
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Exceptions may be raised as instructions are retired. Thus, 
exceptions do not occur speculatively, but rather exceptions 
occur in the correct order, and the processor 10 can be 
correctly restarted after execution. 

When a pop completes and writes its result to the 
destination, it is retired. Up to three pops may be retired per 
cycle. A ReOrder Buffer (ROB) (not shown) in the retirement 
section 20 is the unit in the processor 10 which buffers 
completed pops, updates the architectural state in order, and 
manages the ordering of exceptions. 

The retirement section 20 also keeps track of branches and 
sends updated branch target information to the BTB 2 8 to update 
branch history. In this manner, traces that are no longer 
needed can be purged from the trace cache 2 6 and new branch 
paths can be fetched, based on updated branch history 
information. 

Referring to FIG. 2, an execution environment 50 is shown. 
Any program or task running on the processor 10 (of FIG. 1) is 
given a set of resources for executing instructions and for 
storing code, data, and state information. These resources make 
up the execution environment 50 for the processor 10. 
Application programs and the operating system or executive 
running on the processor 10 use the execution environment 50 
jointly. The execution environment 50 includes basic program 
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execution registers 52, an address space 54, Floating Point Unit 
(FPU) registers 56, multimedia extension registers (MMX) 58, and 
SIMD extension registers 60. 

Any task or program running on the processor 10 can address 
a linear address base 54 of up to four gigabytes (2 32 bytes) and 
a physical address space of up to 64 gigabytes (2 36 bytes) . The 
address space 54 can be flat or segmented. Using a physical 

3 6** 1 

address extension mechanism, a physical address space of 2 can 
be addressed. 

The basic program execution registers 52 include eight 
general purpose registers 62, six segment registers 64, an 
E FLAGS register 66, and an EIP (instruction pointer) register 
68. The basic program execution registers 52 provide a basic 
execution environment in which to execute a set of general 
purpose instructions. These instructions perform basic integer 
arithmetic on byte, word, and doubleword integers, handle 
program flow control, operate on bit and byte strengths, and 
address memory. 

The FPU registers 56 include eight FPU data registers 70, 
an FPU control register 72, a status register 74, an FPU 
instruction pointer register 76, an FPU operand (data) pointer 
register 78, an FPU tag register 80 and an FPU op code register 
82. The FPU registers 56 provide an execution environment for 
operating on single precision, double precision, and double 

8 
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extended precision floating point values, word-, doubleword, and 
quadword integers, and binary coded decimal (BCD) values. 

The eight multimedia extension registers 58 support 
execution of single instruction, multiple data (SIMD) operations 
on 64-bit packed byte, word, and doubleword integers. 

The SIMD extension registers 60 include eight extended 
multimedia (XMM) data registers 84 and an MXCSR register 86. 
The SIMD extension registers 60 support execution of SIMD 
operations on 128-bit packed single precision and double 
precision floating point values and on 128-bit packed byte, 
word, doubleword and quadword integers. 

A stack (not shown) supports procedure or subroutine calls 
and the passing of parameters between procedures or subroutines. 

The general purpose registers 62 are available for storing 
operands and pointers. The segment registers 64 hold up to six 
segment selectors. The E FLAGS (program status and control) 
registers 66 report on the status of a program being executed 
and allows limited (application program level) control of the 
processor. The EIP (instruction pointer) register 68 has a 32- 
bit pointer to the next instruction to be executed. 

The 32-bit general purpose registers 62 are provided for 
holding operands for logical and arithmetic operations, operands 
for address calculations, and memory pointers. The segment 
registers 64 hold 16-bit segment selectors. A segment selector 
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is a special pointer that identifies a segment in memory. To 
access a particular segment in memory, the segment selector for 
that segment must be present in the appropriate segment register 
64. 

When writing application code, programmers generally 
produce segment selectors with assembler directives and symbols. 
The assembler and other tools generate the actual segment 
selector values associated with these directives and symbols. 
If writing system code, programmers may need to generate segment 
selectors directly. 

How segment registers 64 are used depends on the type of 
memory management model that the operating system or executive 
is using. When using a flat (unsegmented) memory model, the 
segment registers 64 are loaded with segment selectors that 
point to overlapping segments, each of which begins at address 
zero on the linear address space. These overlapping segments 
also include the linear address space for the program. 
Typically, two overlapping segments are defined: one for code 
and another for data and stacks. The CS segment register (not 
shown) of the segment registers 64 points to the code segment 
and all other segment registers point to the data and stack 
segment . 

When using a segmented memory model, each segment register 
64 is ordinarily loaded with a different segment selector so 

10 
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that each segment register 64 points to a different segment 
within the linear address space. At any time, a program can 
thus access up to six segments in the linear address space. To 
access a segment not pointed to by one of the segment registers 
64, a program first loads the segment selector to be accessed 
into a segment register 64. 

The 32-bit E FLAGS register 66 has a group of status flags, 
a control flag, and a group of system flags. Some of the flags 
in the E FLAGS register 66 can be modified directly, using 
special purpose instructions. The following instructions can be 
used to move groups of flags to and from the procedure stacks or 
general purpose register: LAHF, SAHF, push-F, push-FD, pop-F, 
and pop-FD. After the contents of E FLAGS register 66 have been 
transferred to the procedure stack or a general purpose 
register, the flags can be examined and modified using the 
processor 10 bit manipulation instructions. 

When suspending a task, the processor 10 automatically 
saves the state of the E FLAGS register 66 in the task state 
segment (TSS) (not shown) for the task being suspended. When 
binding itself to a new task, the processor 10 loads the E FLAGS 
register 66 with data from the new tasks program state register 
(PSS, not shown) . 

When a call is made to an interrupt or an exception handler 
procedure the processor 10 automatically saves the state of the 

11 
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E FLAGS register 66 on the procedure stack. When an interrupt or 
exception is handled with a task switch, the state of the E FLAGS 
register 66 is saved on the TSS for the task being suspended. 

The fundamental data types used in the processor 10 are 
bytes, words, doublewords, quadwords and double quadwords . A 
byte is eight bits, a word is two bytes (16-bits), a doubleword 
is four bytes (32-bits), a quad word is eight bytes (64-bits) , 
and a double quadword is sixteen bytes (128-bits) . 

Referring to FIG. 3, the first n entries of the pIP stock 
100 are the in flight part of the pIP stack 100. In flight 
entries refer to entries currently being processed. The other 
entries are the retired part of the pIP stack. Retired entries 
are those that are no longer being processed. 

A pIP field 104 has the pIP pushed by a ms__push pOp 
(described below) and used by a fast_return jiOp (described 
below) and has a width of 14 bits. 

A BackPtr field 106 points to a next entry in the pIP stack 
for pTOS to point to after an ms_return/ms_pop pop. It has room 
for the pointer value and a wrap bit so its width depends on 
stack size. 

When an in flight entry retires, the RetPtr field 102 is 
updated to point to the location in the retired stack (not 
shown) to which the entry is copied. Thus, its width depends on 
the stack size. 

12 
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A RO/RI field 108 records whether this in flight entry has 
retired. Two bits are needed to handle wrap cases and thus its 
width is 2 bits. 

The jjIP stack 100 includes four pointers that point to 
different entries in the jalP stack 100. The four pointers are a 
liTOS pointer 110, an pAlloc pointer 112, a NextRet pointer 114, 
and a pRetTOS pointer 116. The ]iT0S pointer 110, jiAlloc pointer 
112, and NextRet pointer 114 require a wrap bit. 

The jiTOS pointer 110 is the current top of stack 100 for 
jiOp issue and points to one of the entries in the table or to a 
NULL entry. The pTOS pointer 110 is set to the current pAlloc 
pointer 112 on the issue of a ms_push pOp (described below) . 
Note that it can point to any entry in the table (both in the in 
flight section and the retired section) . 

The uAlloc pointer 112 points at the next entry to be 
allocated when an ms_push pOp (described below) is issued. The 
last entry this pointer can point to is n-1. After this point it 
wraps, so the entries from 0 to n-1 are treated as a circular 
queue . 

The NextRet pointer 114 points at the next entry to be 
deallocated from the yilP stack 100 when a pIP stack operation 
retires. Like the jiAlloc pointer 112, this pointer wraps at n-1. 

The pRetTOS pointer 116 points at the retired top of stack 
100. This pointer can never point to entries 0 to n-1. 

13 
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Additional pOps are used with the pIP stack 100. The 
additional pOPs are: ms_call, ms_push, ms_pop, ms_return, 
ms_tos_read, and ms_pip_stack_clear . Alternatively , call, 
return, and clear could be attached to other pops. 

The ms_call pOP takes the next pip, pushes it on the pIP 
stack 100, and uses the pip in the immediate field as the target 
pip of a jump. 

The ms_push pOP takes the value in the immediate field and 
pushes it on the pIP stack 100. 

The ms_pop pOP pops a value off the pIP stack 100 and 
replaces this pOp's immediate field. 

The ms__return pOP pops a value off the pIP stack 100 and 
jumps to that pip. 

The ms_tos_read pOP reads a value off the pIP stack 100 and 
replaces this pOp T s immediate field, without changing the 
contents of the pIP stack 100. 

The ms_pip__stack_clear pOP sets the pIP stack pointers to 
the reset values. Note that this pOp is executed at a time when 
all preceding stack operations and retirements are complete. 

Referring to FIG. 4, a micro-instruction pointer (pIP) 
stack process 200 includes executing (202) microcode (pcode) 
stored in a out-of-order pIP stack. The process 200 pushes 
(204) a next pIP on to the pIP stack and uses the next pIP in an 
intermediate field as a target pIP in a jump operation. The 

14 
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process 200 takes (206) a value of an intermediate field of a 
microoperation (\xOp) and pushes the value on to the pIP stack. 

The process 200 pops (208) a value off the pIP stack and 
replaces a current ]iOp intermediate field with the value. The 
process 200 pops (210) a value off of the jjIP stack and jumps to 
that value. 

The process 200 reads (212) a value off the pIP stack and 
replaces a jiOp's intermediate field with the value. The process 
200 sets (214) the pIP stack pointers to reset. 

The following terminology is used throughout the 
description below. MAX_I N FL I GHT refers to the maximum number of 
calls allowed to be alive in the processor at once. MAX_STACK 
refers to the deepest function nesting level allowed. RET_OFFSET 
refers to offset in the table 100 of the first entry in the 
retired area. NULL_INDEX refers to the index in the table 100 of 
the null entry. The code below assumes that this lies between 
the in flight section and the retired section of the stack 100. 

At reset: 

pTOS.ptr = NULL_INDEX 
liTOS.wrap = 0 
pAlloc . ptr = 0 
liAlloc.wrap = 0 
NextRet.ptr = 0 
NextRet.wrap = 0 
liRetTOS = NULL INDEX 



15 
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On issue of ms_call pOp: 

if (pAlloc. ptr == NextRet. ptr && pAlloc. wrap ! = 

NextRet .wrap) MSStall; 

stack [pAlloc. ptr] .BackPtr = pTOS; 

stack [pAlloc . ptr], pip = current_pip + 1; 

stack [pAlloc . ptr ] . R [pAlloc . wrapl = 0; 

pTOS = pAlloc; //copies both the pointer and the wrap 

bit 

pAlloc. ptr = (pAlloc. ptr + 1 ) %MAX_INFLIGHT; 
if (pAlloc. ptr ==0) 

pAlloc. wrap A =l; 
next_pip = ms__call pip (immediate field) 



where, if the pAlloc pointer is equal to the NextRet pointer and 
their wrap bits are different, then the in flight table is full 
and one cannot issue any more call/push pops until one retires. 
If the table is not full, then pAlloc. ptr points to the next 
entry to be allocated, so it is updated. More specifically, the 
current value of pTOS is placed into the BackPtr so we know 
where to return to. The pIP of the pop after the call/push is 
put into the pip field. One of the R (retired) bits cleared 
(which R bit one depends on the current wrap bit of pAlloc) . The 
pTOS is set to point to the current entry (pAlloc) . Both the 
pointer and the wrap bit must be copied. pAlloc is incremeted, 
wrapping (and toggling the wrap bit) if necessary. Finally, 
branch to the pIP in the immediate field of the pop. 

On issue of ms_push pOp instruction, the same events as in 
a ms_call pOP occur, except that the pop's immediate field is 
copied into the pip field of the stack instead of the pIP of the 



10559-642001 / P12486 

next pop, and the next pIP to be sequenced is determined as 
usual . 

On issue of ms_return pOp instruction: 

next_pip = stack[pTOS.ptr] .pip; 

back_ptr = stack [pTOS . ptr] . BackPtr; 

if ( stack [back_ptr. ptr] . R [back_ptr . wrap] ==1) 

pTOS.ptr - stack [back_ptr .ptr] .RetPtr; //wrap bit 

doesn't matter 

else 

pTOS = back__ptr; //copies both pointer and wrap bit 

where it gets the next pIP to sequence from the pip field of the 
stack entry pointed to by pTOS. Then pop the stack: the 
BackPtr of the entry pointed to by pTOS has the index of the 
entry underneath this one on the stack. However, if that entry 
has retired since the BackPtr was set, it may have been 
overwritten by another speculatively issued call. So check the 
R bit of the entry pointed to by BackPtr. If it is 0, then the 
BackPtr entry is valid and we set pTOS to point to it; if the R 
bit is 1, then the RetPtr field of that entry has its forwarding 
address . 

On issue of the ms_pop pOP instruction the same events 
occur as the ms_return pOP instruction, except the immediate 
field of the ms_pop pop is replaced with the pip field from the 
stack entry pointed to by pTOS, and the next pIP is determined 
normally. 

17 
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On retirement of ms_call or ms_push pOp instruction: 
old_pRetTOS = pRetTOS; 

pRetTOS++; //no wrap needed — better not overflow! 
stack [pRetTOS] .BackPtr.ptr - old_pRetTOS; //wrap bit 
doesn't matter 

stack [pRetTOS] .pip = stack [NextRet . ptr] . pip; 

stack [NextRet-ptr] .RetPtr = pRetTOS; 

stack [NextRet . ptr] . R [NextRet . wrap] =1; 

if (NextRet. ptr == pTOS.ptr) //wrap bits always the 

same 

pTOS.ptr = pRetTOS; //wrap bit doesn f t matter 
NextRet. ptr = (NextRet. ptr +1) %MAX_IN FLIGHT; 
if (NextRet. ptr == 0) 

NextRet. wrap A =l; 
clear any MSStall due to full in-flight stack; 



where the pRetTos is incremented, pRetTOS, making sure it moves 
between NULL and the first entry correctly. The old value of the 
pRetTOS is put in the BackPtr of the new retired entry. The pIP 
from the entry pointed to by NextRet (the next entry to retire) 
is copied to the pIP field of the new retired entry. The RetPtr 
of the entry pointed to by NextRet is set to the new pRetTOS . 
The R bit of the entry pointed to by NextRet is set to 1. If the 
NextRet pointer equals the pTOS pointer, then we have just 
invalidated the entry pointed to by pTOS, so set pTOS to point 
to the retired copy (the new value of pRetTOS) . Increment 
NextRet, wrapping and toggling the wrap bit if necessary. Clear 
the MS stall condition resulting from too many push/call 
operations in flight. 



18 
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On retirement of ms return or ms_pop uOp instruction: 



uRetTOS — ; 

-OR- uRetTOS = stack [uRetTOS] . BackPtr; 

where the uRetTOS pointer is decremented, or replaced with the 
BeckPtr from the entry it points to; these are equivalent. The 
BackPtr is implemented for the retired stack since it is used in 
the manipulation of uTOS (unless the rule is: if uTOS is within 
the retired stack, decrement; otherwise follow the BackPtr) . 
On mispredicted macrobranch/microbranch: 



uAlloc = mispred_uAlloc; //copies both pointer and 
wrap bit if 

(stack[mispred_uTOS.ptr] . R [mispred_uTOS . wrap] ) 

uTOS.ptr = stack [mispred_uTOS . ptr] .RetPtr; //wrap 
bit doesn't matter 
else 

pTOS = mispred_uTOS; //copies both pointer and wrap 
bit 

where the uAlloc and uTOS pointers are restored to the values 
that were saved when the branch which is mispredicting was 
issued. However, if the entry which the branch's pTOS points to 
has retired, set uTOS to point to its new location in the 
retired stack instead. 
On trap or fault: 



UTOS.ptr = NULL_INDEX 
uTOS.wrap = 0 
uAlloc. ptr = 0 
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pAlloc.wrap = 0 
NextRet.ptr = 0 
NextRet.wrap = 0 
pRetTOS = NULL_INDEX 

On assist: 

pRetTOS-f+; 

stack [pRetTOS] .pip = assist pip 

stack [pRetTOS] .BackPtr = pRetTOS - 1; 

pTOS = pRetTOS 

pAlloc.ptr = 0 

pAlloc.wrap = 0 

NextRet.ptr = 0 

NextRet.wrap = 0 

In the case of a trap, the pIP stack 100 can be completely 
cleared. By definition of trap, all the previous flows are 
complete, and all the new flows are speculative, so all values 
on the pIP stack are speculative and can be thrown away. 

There are two cases for a fault. If the fault will not 
return to the current flow of execution, the pIP stack 100 can 
be completely cleared. If the fault will return to the flow of 
execution, either the pIP stack 100 needs to be recovered or it 
needs to be cleared and a restriction placed on flows which can 
do this as to their use of ms_push/f ast_return . 

The following example illustrates operation of the pip 
stack 100. Consider, for example, the following sequence of 
events occurring in the pIP stack 100: 

(A) issue ms_call #1 from pip X 

(B) issue ms_call #2 from pip Y 

20 
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(C) issue |a_jump_cc #1 which will mispredict 

(D) issue ms_ret from call #2 

(E) issue p__jump_cc #2 which will mispredict 

(F) retire call #1 

(G) p_Jump_cc #2 executes and mispredicts 

(H) p_Jump_cc #1 executes and mispredicts 

(I) retire call #2 

Below is the pIP stack 100 as it will appear after each of 
these operations, assuming MAX__INFLIGHT=3 and MAX_STACK=4 . The 
pointers are indicated on the right; the number to the right of 
the pointer is the wrap bit. 

Start: 



Entry yilP BackPtr RO Rl RetPtr 

0| | III I <iaAlloc-0 <NextRet-0 

II I III I 

2| I III I 

NULL | 0 I NULL | 0 I 0 I <pRetT0S <pTOS 

4 I I | 0 | 0 | 

5 I I | 0 | 0 | 

6 | I | 0 | 0 | 

7 I i | 0 | 0 | 



After (A): Push X+l onto the pIP stack 100, update pAlloc 
and pTOS pointers. 
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and iaT0S pointers. 








ft i 




Entry 


pIP BackPtr 


R0 


Rl 


RetPtr 


•r- 


25 


0 1 


X+l I NULL 


1 o 


i 


| | <NextRet-0 






1 1 


Y+l i 0-0 


I o 


1 


| | <pTOS-0 




30 


2 I 




1 


1 


| 1 <pAlloc-0 


M 


NULL | 


0 i NULL 


1 o 


I o 


| <yiRetTOS 






4 I 




I o 


1 o 


! 




35 


5 1 




I o 


1 o 








6 I 




I o 


I o 






40 


7 1 




1 o 


I o 





45 



After (C) : ]a_jump_cc #1 issues, taking the values of 
uAlloc=2-0 and uTOS=l-0 with it. 

After (D) : Next uIP is Y+l (pip field of uTOS entry) , 
BackPtr of uTOS entry (0-0): look up stack [BackPtr .ptr] 

22 



Take 
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= T ... 



10 



15 



20 



25 



.R[Backptr.wrap] : stack[0].R0 indicates this entry has not 
retired and is still valid, so the yiTOS pointer gets 0-0. 



Entry 


uIP 

r 


BackPtr 


R0 


Rl 


RetPtr 


0 I 


X+l 


| NULL 


1 o 




| <NextRet-0 

1 


1 | 


Y+l 


| 0-0 


1 o 






2 1 






! 




| <pAlloc-0 


NULL | 


0 


| NULL 


I o 


1 0 ! 


<yiRetTOS 


4 1 






1 o 


1 0 | 




5 I 






I o 


1 o | 




6 1 






I o 


i o | 




7 1 






1 o 


1 0 | 





After (E) : ia_jump_cc_2 issues, taking the values of 
|aAlloc=2-0 and }iTOS=0-0 with it. 

After (F) : Increment pRetTOS and copy NextRet entry to new 
pRetTOS entry. Set the RetPtr of the NextRet entry to point to 
its new location, and set the R bit. Increment NextRet. 



30 



35 



40 



45 



Entry 


tilP 


BackPtr 


R0 


Rl 


RetPtr 


0 I 


X+l 


| NULL 


1 1 




4 


1 1 


Y+l 


| 0-0 


1 o 






2 1 












NULL 1 


0 


| NULL 


1 o 


1 0 I 




4 1 


X+l 


| NULL 


1 o 


1 0 | 




5 I 






i o 


1 0 | 




6 1 






I o 


1 0 I 




7 1 






l o 


i o | 





<yiTOS-0 

<NextRet-0 

<pAlloc-0 

<pRetT0S 
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After (G) : ia_jump_cc #2 mispredicts, returning pAlloc=2-0 
and pTOS=0-0. Set yiAlloc to 2-0, no change. Check R bit of 
yiTOS being restored — it is set, so set yaTOS to the RetPtr of 
that entry instead. 



Entry yilP BackPtr R0 Rl RetPtr 

0 I X+l I NULL 111 I 4 | 

1 | y+1 I 0-0 I 0 | | I <NextRet-0 
2| | III I <pAlloc-0 
NULL | 0 I NULL | 0 | 0 | 

4 | X+l | NULL | 0 | 0 | <pRetTOS <pTOS-0 

5 I i | 0 | 0 | 

6 i i | 0 1 0 | 

7 | I | 0 | 0 | 



After (H) : vL_jump_cc #1 mispredicts, returning pAlloc=2-0 
and pTOS=l-0. Set pAlloc to 2-0, no change. Check R bit of 
pTOS we're restoring — it is not set, so set pTOS to the value 
returned by the mispredict. 



Entry 



pIP 



BackPtr R0 Rl RetPtr 



0 


X+l 


| NULL 


1 1 1 


1 1 


\ 1 


1 


Y+1 


I 0-0 


1 o 




I <NextRet-0 


2 










| <pAlloc-0 


NULL 


0 


1 NULL 


I o 


0 1 




4 


I X+l 


I NULL 


1 o 


0 1 


<pRetTOS-0 


5 






I o 


0 i 
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6 I I | 0 I 0 I 
I 1 1 — | 1 

7 I I | 0 | 0 | 



After (I) : increment uRetTOS and copy NextRet entry to new 
pRetTOS entry. Set the RetPtr of the NextRet entry to point to 
its new location, and set the R bit. Since NextRet==uTOS, we 
have just retired the last valid entry on the uIP stack 100, so 
set uTOS to point to the new location of the current entry on 
the retired stack. Increment NextRet. 



Entry pIP BackPtr R0 Rl RetPtr 





15 


0 


X+l 


| NULL 


1 1 1 


1 4 I 


U! 




1 


Y+l 


| 0-0 


1 0 | 


| 5 1 <NextRet-0 




20 


2 










I <yiAlloc-0 <NextRet 




NULL 


0 


| NULL 


I o 


0 I 








4 


X+l 


| NULL 


1 o 


0 1 




M 8 - 


25 


5 


I Y+l 


1 4 


l o 


0 1 


<yiRetTOS <yiTOS-0 




6 






1 o 


0 I 






30 


7 






I o 


0 i 





Several considerations can be made for debugging and design 

verification. For example, for patching considerations, the 

uRetTOS pointer can be readable and writeable through microcode. 

In addition, the retired instruction can be writeable through 

control register access. This allows microcode to clear the 

instruction from in flight stack. The microcode can thus read 

25 
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the uRetTOS to determine the number of entries on the retired 
stack and pop the entries off the stack 100. Popping entries 
off the stack 100 takes the entries to the EXECUTIVE where the 
entries can be examined. The microcode can restore the uRetTOS 
(which puts the stack back to the state it was before the pops), 
and modify the values in the uRetTOS via control register 
writes . 

The stack pointers uTOS, uAlloc, and NextRet should be 
visible for debugging. One way to make the stack pointers 
viable is to allow access through a control register. 

Access to the in flight uIP stack 100 can be through a 
control register mechanism, but an array dump mechanism is 
acceptable . 

Having control register access to the in flight uIP stack 
100 hardware may increase microcode flexibility at the risk of 
being extremely hard to maintain correctness. 

A number of embodiments of the invention have been 
described. Nevertheless, it will be understood that various 
modifications may be made without departing from the spirit and 
scope of the invention. For example, an option is to provide a 
path from the TBPu to the MS where the EV_uIP can be accessed. 
This would allow the assisting pOp's pip stack 100 to be pushed 
on the uIP stack 100 and allow faster returns from assists. 
Alternately, another uOp could be used to get the uip from the 
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EXEC to the MS for pushing on the MS stack. For longer assist 
flows, this could eliminate the indirect branch latency. 

Accordingly, other embodiments are within the scope of the 
following claims. 
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